Attacking Generalized Tree Alignment
نویسندگان
چکیده
Many multiple alignment methods implicitly or explicitly try to minimize the amount of biological change implied by an alignment. At the level of sequences, biological change is measured along a phylogenetic tree, a structure frequently being predicted only after the multiple alignment instead of together with it. The Generalized Tree Alignment problem addresses both questions simultaneously. It can formally be viewed as a Steiner tree problem in sequence space and our approach merges a path heuristic for the construction of a Steiner tree with a clustering method as usually applied only to distance data. This combination is achieved using sequence graphs, a data structure for efficient representation of similar sequences. The method produces biologically meaningful answers, our experimental results seem promising, and a variant maintains a guaranteed error bound of (2 2 n) for n sequences. Introduction Phylogeny Construction from molecular sequence data is a prominent application of the notion of a minimal Steiner tree[8, 3, 4]. One of the first formal versions of phylogeny construction interpreted the ancestral sequences as Steiner points in a hypercube over the letters A; C; G; T. Let a set of aligned sequences and a tree topology where the leaves are labeled with the sequences be given. For any assignment of sequences to the inner nodes the length of an edge is defined as the number of mismatches between the sequences at the nodes incident to the edge. A parsimonious assignment of sequences to the inner nodes is one that minimizes the sum of the length of the edges. Finding this assignment has become known as the parsimony problem. An algorithm for its solution that is linear in the number of species and in the length of the sequences has been given by Fitch [2] and its correctness proved by Hartigan [6]. In applying this formal framework to the construction of a phylogeny one has to find the tree topology which gives rise to the most parsimonious tree. The problem of finding a most parsimonious tree can be put in graph theoretic terms. The graph to be studied has all possible sequences of a given length as its nodes. Edges are introduced between nodes whenever the corresponding sequences differ by exactly one mismatch. The most parsimonious tree is the optimal Steiner tree linking the nodes corresponding to the given sequences. Based on this view of the problem, finding the most parsimonious tree was shown to be NP-hard [1]. This formalization of the same problem stated without a given alignment is called generalized tree alignment [9] and has been shown to be MAX SNP-hard [9]. Approximation algorithms The interpretation of phylogeny construction as finding a minimal Steiner tree immediately suggests the application of approximation algorithms for minimal Steiner trees [8] to the phylogeny and the phylogeny/alignment problems. Gusfield [5] suggested to use the minimum spanning tree heuristic. The obvious drawback in applying the MST heuristic is that the tree thus constructed usually has species as inner nodes. The algorithms we examine are derived from a class of Steiner tree approximation algorithms called path heuristics [8], which are themselves related to minimum spanning tree algorithms. Path heuristics start with the input sequences 1Work supported by DFG grant Vi–160/1–1 2E-mail: [email protected] 3E-mail: [email protected] as trees, each consisting of a single node. Iteratively a shortest possible path connecting two different trees is inserted. Methods differ according to the definition of the set of allowed “attachment points” where shortest paths connecting to another tree start. With only leaves allowed as attachment points, this is equivalent to the classical minimum spanning tree algorithm of Kruskal [10]. Allowing more than just leaves (e.g. previously computed Steiner points) as attachment points permits better solutions, both in terms of tree length, topology and its biological meaningfulness. Note the resemblance of this procedure to hierarchical clustering methods commonly used in phylogeny reconstruction. These methods, too, merge two “closest” sets of sequences in an iterative fashion. The class of algorithms we describe here uses previously computed Steiner points as attachment nodes. For two sequences an alignment describes a shortest path between the sequences and thus also a set of Steiner points. Unlike in Euclidean space, in sequence space there can be many different shortest paths. One requires a data structure to represent all resulting Steiner sequences. At this point, we employ the concept of a Sequence Graph introduced by Hein [7]. A sequence graph is a network (DAG) with edges annotated by letters or gaps. A sequence that is spelled by letters on the edges of a path from the source to the sink node of the network is said to be represented by the sequence graph.
منابع مشابه
Approximation algorithms for constrained generalized tree alignment problem
In generalized tree alignment problem, we are given a set S of k biologically related sequences and we are interested in a minimum cost evolutionary tree for S. In many instances of this problem partial topology of the phylogenetic tree for S is known. In such instances, we would like to make use of this knowledge to restrict the tree topologies that we consider and construct a biologically rel...
متن کاملSuffix Tree of Alignment: An Efficient Index for Similar Data
We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings A and B is a compacted trie representing all suffixes in A and B. It has |A|+ |B| leaves and can be constructed in O(|A|+ |B|) time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not ex...
متن کاملA Clustering Approach to Generalized Tree Alignment with Application to Alu Repeats
A formalization of the multiple sequence alignment problem that emphasizes the problem’s evolutionary aspect is the Generalized Tree Alignment Problem. Given a set of sequences, this formalization asks for a phylogenetic tree and ancestral sequences such that the implied amount of change necessary to explain the given data is minimal. The problem is computationally hard and we present a heurist...
متن کاملUsing an Extended Suffix Tree to Speed-up Sequence Alignment
An important problem in computational biology is the alignment of a given query sequence and sequences in a database to find similar (locally or globally) sequences from the database to the query. Many heuristic algorithms for this problem are based on the idea of locating a fixed-length matching pair of substrings (called a seed) to start an alignment, and then extending this alignment using d...
متن کاملSuffix Array of Alignment: A Practical Index for Similar Data
The suffix tree of alignment is an index data structure for similar strings. Given an alignment of similar strings, it stores all suffixes of the alignment, called alignment-suffixes. An alignment-suffix represents one suffix of a string or suffixes of multiple strings starting at the same position in the alignment. The suffix tree of alignment makes good use of similarity in strings theoretica...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007